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Abstract 

The rising trend of coauthored academic works obscures the credit assignment 
that is the basis for decisions of funding and career advancements. In this 
paper, a simple model based on the assumption of an unvarying “author ability” 
is introduced. With this assumption, the weight of author contributions to a 
body of coauthored work can be statistically estimated. The method is tested 
on a set of some more than five-hundred authors in a coauthor network from the 
CiteSeerX database. The ranking obtained agrees fairly well with that given by 
total fractional citation counts for an author, but noticeable differences exist. 
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1. Introduction 

Typical quantitative indicators of scientific productivity and quality that 
have been proposed— be it on the level of individuals, institutions or even whole 
geographic regions—are, in some form or another, ultimately based on the ci¬ 
tation distribution to previous (and available) scientific works (in this paper 
referred to as “papers” for short for all types [books, regular articles, rapid 
communications, commentaries, proceedings, etc.]). A fairly extensive scien¬ 
tific literature exists on the subject of discriminating between individuals or 
scientific institutions, motivated to a large extent by the perceived need of the 
merit-based distribution of funding which is scarce in relation to the number of 
active scientists. Such indicators range from the simple (counting the number 
of papers and/or citations) to the more elaborate, such as the h-index (Hirsch, 
2005; Jin, 2006; Hirsch, 2007; Bornmann and Daniel, 2005, 2007b; Bornmann 
et ah, 2008) and its many variants (Egghe, 2006; Kosmulski, 2006; Jin, 2007; 
Jin et al., 2007; Egghe and Rousseau, 2008; Bras-Amoros et ah, 2011; Ausloos, 
2015). For a recent and in-depth review of the fundamentals this topic (cita¬ 
tion counting), see the paper by Waltman (2016). This comparison is in some 
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schools of bibliometrics developed further in that the incoming citations to a 
paper are weighted by the importance of the citing source. This importance can 
be defined, for instance, from the number of citations the citing paper has itself 
received, or the number of citations of the citing author. For a review of this 
topic and an empirical investigation of its robustness, see the paper by Wang 
et al. (2016). 

In this paper, we are motivated by the confounding factor that coauthorship 
poses to any such analysis. Different options for dealing with this problem have 
been proposed. The simplest is to divide the credit equally among all contribut¬ 
ing authors (Batista et al., 2006; Schreiber, 2008) (known both as “fractional 
counting” or “normalized counting”); after that comes weighting author credit 
by a simple function of the author’s position in the author list (Hagen, 2009; 
Sekercioglu, 2008; Zhang, 2009), or even more intricate schemes based on this 
notion (Aziz and Rozing, 2013). However, these alternatives cannot be mo¬ 
tivated by more than “hunches” about how a particular “authorship culture” 
assigns credit. Clearly, a quantitative approach is more scientific than a qualita¬ 
tive, or worse, arbitrary one. Special mention is here given to the papers by Tol 
(2011) and by Shen and Barabasi (2014), in which intuitive statistical models 
are used to disentangle the coauthorship contributions. 

Tol’s (2011) idea may be summarized as follows. Whenever two authors 
write a joint paper and it is highly cited, the senior author of the pair 1 should 
receive a disproportionally large share of the citation credit. The rationale for 
this is that it is more typical of the senior author, judging from past experience, 
to write highly cited papers, and it is therefore reasonable to assume that her 
contribution is more responsible for the ultimate quality. With his method and 
a limited sample set comprising some fifty authors, Tol (2011) finds small devia¬ 
tions of up to 25% between his “Pareto weights” and what he terms “egalitarian 
weights” in which coauthorship credit is equally distributed. 

Shen and Barabasi (2014) agree with Tol (2011) on the principle of assigning 
more credit to the “senior author”, but the algorithm to determine the actual 
credit assignment is different. To determine the “relative seniority” of each 
coauthor, their algorithm weighs both the number of papers by the author and 
the degree to which these papers share citations from papers citing the one 
under consideration. In this way, papers that are more “similar” to the one 
under consideration contribute more to the “seniority” of that coauthor when 
assigning the authorship credit. 

The idea behind the present paper is basically the same, but the execution 
is different. Rather than assume a fixed form of a distribution like Tol (2011), 
we assume a fixed form for the underlying “ability” to produce said distribu¬ 
tion in the first place. We then solve for this “author ability” statistically to 
find those authors who consistently manage to contribute to “high-quality” pa¬ 
pers. Another difference, which also distinguishes the method from that by Shen 


1 Defined in terms of “Pareto weights” which are directly related to the average citations 
per article of an author. 
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and Barabasi (2014), is that a junior author is not necessarily “punished” for 
publishing with a senior coauthor. If a paper is very successful compared to pre¬ 
vious papers on the topic, it is not altogether unreasonable to assume that this 
atypical performance should be disproportionately credited to any authors not 
participating in the earlier work. However, in both Shen and Barabasi (2014) 
and in Tol (2011), credit is instead disproportionately allocated to the senior 
author. Much like Tol (2011), the rigorous application of our method requires 
knowledge of complete coauthor networks, and can only be approximately ap¬ 
plied otherwise. This is, however, more of a formal problem than a practical 
one. 

2. Regression model for coauthorship contribution 

We assume that the arbitrary author i has an unchanging ability a* for con¬ 
tributing to scientific papers. 2 A paper a, once produced, possesses a “scientific 
quality” that we non-committally denote by q a for now. This variable could be, 
for instance, the total number of citations or the rate of citation accumulation, 
to name a few. For notational simplicity, we define the elements, f a i, of a di¬ 
mensionless “authorship tensor” F, to be unity if author i contributes to paper 
a , and zero otherwise: 


. / 1, if « is author of a 

•'“* ( 0, otherwise 

With these definitions, we now define oq through, 

M a 

lng Q = ^/ailnaj (2) 

*=l 

where M a is the total number of authors in the statistical sample, formally the 
number of individuals who have ever produced a work of science. In practical 
calculations, we limit ourselves to much smaller subsets of authors in a citation 
database. With modern computers, solving the complete system of equations 
is possible if one has access to the entire database. Typically, for individuals, 
the database is only partially accessible through search keywords of an online 
interface and the database in its entirety is not allowed (because of commercial 
contracts between the library and the database provider, for instance) to be 
downloaded and mined for its data. Such a limitation does not pose a greater 
problem than the reduction of the underlying statistical data. 

Before we continue, we note that the choice of the logarithm function in 
Eq. (2) is judicious. First, it implies that “the whole is not equal to the sum 


2 This assumption does not contradict the statement in the Introduction that “a senior 
author, judging from past experience,” is more typically able to write highly cited papers. 
The senior author may always have been good at producing highly cited scientific output, but 
contrary to the case of the junior author, she has the credentials to back it up. 
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of its parts” and is meant to capture at least some of the synergistic effects 
of a collaboration (as suggested, for instance, by Figg et al. (2006)): in other 
words, the relation between the number of authors and the resulting quality 
of the paper is taken to be non-linear rather than linear. Here, we follow Ke 
(2013) closely, but replace his “paper fitness” by our “author ability”. Ke’s 
model is more general, but we do not want to proliferate the number of fitting 
parameters needlessly. Second, since the value of q may vary over several orders 
of magnitude in typical cases ( vide infra), the logarithm ensures a more modest 
range for the regression. This said, Eq. (2) is obviously an Ansatz chosen merely 
for its simple mathematical form rather than being based on some underlying 
physical understanding of research production within collaborations. 

If among themselves, M a authors have published exactly M a papers, Eq. (2) 
forms a system of M a linear equations that can be solved, in principle, for the 
unique set {a,}^ of author abilities if the determinant of the square matrix 


F = 


fll 

/lM a 

■ ■ ^ 

' ’ /M a M a 


( 3 ) 


is non-zero. Such a situation is a priori atypical, and the more common case 
is where the number of papers, M p , does not equal M a . However, the methods 
of statistical fitting (e. g., least-squares) can still produce a set {cq}^, which 
may be unique or not depending on the circumstances. Hence, the proposed 
method may be seen as the regression analysis for the unknown “author ability” 
underlying quality scientific paper production. The method of least squares is 
the one which we will employ in this work. It has two desirable properties: first, 
it is sensitive to outliers, and thus to very productive or skilled researchers—a 
concern raised principally by Egghe in his g-index (Egghe, 2006); second, it is 
numerically easier to handle than, say, the least-absolute error. For clarity, we 
note that the error function which we seek to minimize is the sum of the squared 
residuals: 


R({ai}) = In 9a - 1 “ ln ai 


( 4 ) 


In a set of scientific papers, the quality—however defined -will exhibit a 
distribution over the papers. The least-squares fitting of the set {lncq} to the 
set {lng Q } may, if no further constraints are present, lead to negative values in 
the former set. While this is reasonable from a statistical point of view, it seems 
self-contradictory from a physical point of view that the addition of an extra 
author to a paper may lead to a decline in the quality of the resulting product. 
Therefore, in this paper we always impose the extra condition lna^ > 0 for all 
i in the author set. The least-square solution of Eq. (2) may then be found by, 
for instance, iterative gradient minimization techniques. 

2.1. On the interpretation of the meaning behind the author ability variable 
From the purely mathematical perspective of author ranking, the condition 
that ln at > 0 is not strictly necessary and there would be some numerical bene- 
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fits for the solution of Eq. (2), were it to be relaxed. For one thing, the residuals 
in the regression would be decreased. However, we stick to this condition in this 
paper because we want to maintain at least some “physical” connotation for the 
a-values. If we allow negative values for In a in the fitting, we basically say that 
adding an author to a collaborative work may lead to a decrease in the resulting 
quality. However— assuming the scientific field, in which the paper is produced 
is sufficiently rigorous to permit a general consensus of the importance of re¬ 
sults —it should be clear that such a situation is only possible if the coauthors 
allow the quality to decline. What would motivate the other authors to allow 
such a decline? In this paper, we work with the basic theoretical assumption 
that all authors are rational agents that seek to maximize the quality of their 
work. This is why the unreasonableness of allowing negative values of In a in 
the fitting becomes even greater in the “hard sciences” in which the consensus 
on the methods and results (for instance, theorems and proofs in computer sci¬ 
ence and mathematics; quantitative measurements and models in the natural 
sciences) that constitute a paper is clear. 

Nevertheless (anticipating our choice for measuring q in the next section), 
we note that while there is general support for the notion that the “quality” of 
a paper—when measured as the number of citations that it accrues—benefits 
from the work of additional authors (Figg et ah, 2006; Bornmann and Daniel, 
2007a; Lokker et ah, 2008), Waltman and van Eck (Waltman and van Eck, 2015) 
find a very slight detrimental effect on the citation counts of papers with three, 
four or five authors with respect to papers authored by two authors (they are 
still cited substantially more than papers by a single author). For six and more 
authors, an unequivocal benefit is seen. Their analysis is based on an average of 
field-normalized citation scores across all the disciplines in the Web of Science 
database and seems to indicate, at first glance, that contrary to our assumption 
additional authors may have a detrimental effect on the quality of a joint paper. 

While the results of Waltman and van Eck (2015) merit more careful scrutiny 
and an analysis broken down by scientific fields, one possible reason for this 
apparent average decline in quality with additional authors could be that larger 
collaborations tend to split work over several different papers, a strategy with 
a known benefit (Bornmann and Daniel, 2007a), to a greater extent than the 
author pair. In this case, the total citation count of that group of coauthors 
should be the sum over their joint papers. We shall correct for this eventuality 
in our analysis ( vide infra) by multiplying the author abilities by the number of 
coauthored papers. However, if the motive were simply to minimize the residuals 
in the fitting, a more malleable model with more fitting parameters would be 
appropriate. Using such a strategy, the residuals can be made to disappear 
completely but at the same time, the validity of the extracted parameters is 
decreased. Nevertheless, at the express insistence of one of the Reviewers, the 
analogous results of those given in the next section will are provided in the 
Appendix. 
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3. Illustrative real-world example 

For purposes of illustration, we take the variable q a to correspond to the 
number of citations of paper a. We will then rank authors, not by di directly 
however, because that will give undue weight to the average performance of an 
author, but rather by nidi, where rii is the number of papers to which author 
i has contributed in the statistical sample. Like this, we hope to cover both 
the “breadth” and “depth” of an author’s output. As the starting point for 
the iterative solution of Eq. (2), we take the fractional number of citations per 
paper for each author i. All numerical calculations were performed using the 
GNU Octave (Eaton et ah, 2009) software, version 3.8.1. 

The statistical basis for this non-exhaustive study was obtained from the 
CiteSeerX online database 3 by compiling the cited papers 4 of renowned com¬ 
puter scientists Thomas H. Cormen 5 and Charles E. Leiserson 6 7 and their imme¬ 
diate coauthors.' This search yielded data for 1228 publications by a total of 
1416 authors, after some manual pruning for author name variations where am¬ 
biguity was not an issue (e. g. “James” or “Jim”) and also for some transcription 
errors in the database (e. g. part of the title of the paper or author information 
[affiliation, etc.] contaminating an author name). However, of these authors, 
856 only appear on one paper each in the dataset and were excluded from the 
regression analysis. This increases the robustness of the results, as any statis¬ 
tical method is only reliable if there are repeated occurrences in the dataset. 
No correction for “inseparable coauthors” (authors who invariably publish to¬ 
gether) was made in the analysis, as such groups are indistinguishable from a 
single author in output and citation data and so cannot be mathematically dis¬ 
entangled. The frequency distributions for the number of times a document or 
an author is cited are given in Figure 1 and are seen to exhibit the heavy tail 
typical of citation distributions (Egghe, 1998). The statistical basis should be 
sufficient for our purposes. 

A least-squares regression analysis was performed on the data to yield a set 
of unique author abilities {a,}. The values for nidi range from 2 to almost 
1000; the distribution is visualized in Figure 2. Evidently, the shape of the 
distribution of the na-values is reminiscent of those of the paper and author 
citations: most authors are of “ordinary” ability and not easily distinguishable. 
The author with the highest na-value (and, incidentally, also the highest a- 
value) in the dataset turns out to be renowned cryptologist Ronald L. Rivest 
(known for the RSA cryptosystem). He is, however, not the most productive 
author in the dataset, having fewer papers than David Kotz; he does, on the 


3 http ://citeseerx.ist.psu.edu, accessed February, 2015. 

4 We limit our study to cited papers, not out of theoretical necessity, but out of practical 
convenience. 

5 Search query: author: "thomas+h+cormen" 

6 Search query: author:"charles+e+leiserson" 

7 Search queries generated automatically by a script on the same model as used for Cormen 
and Leiserson. 
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Figure 2: The distribution of na (rounded to integer values) obtained from the regression 
analysis. 


other hand, have more citations than Kotz and so would rank higher also in 
most classical rankings. The top-ten ranked authors are given in Table 1 with 
some bibliometric data from the dataset. The na-ranking of the top ten follows 
that of the total number of citations closely, but with some notable exceptions: 
Sivan Toledo, David M. Nicol, Michael A. Bender and Robert D. Blumofe all 
obtain a higher ranking under the na-system than they would by just counting 
total citations. Conversely, Satish Rao, Benny Chor and C. Greg Plaxton obtain 
lower rankings under the na-system than they would by total citations. 

The correlation between the integer citation count and the ?ra-values appar¬ 
ent from Table 1 is slightly stronger when the fractional citation count including 
all authors is substituted for the integer one. This is actually a surprising result 
since the na values are calculated from a sample from which authors who only 
appear once have been removed. The strong correlations are, nevertheless, some¬ 
what attenuated when the whole data sample is considered instead of only the 
most outstanding authors: the Pearson correlation coefficient between na and 
the total citation counts for the whole dataset is r = 0.89; and between na and 
the fractional citation counts, it is either r = 0.89 (excluding authors who only 
appear once) or r = 0.92 (including authors who only appear once). However, 
perhaps more interesting for the purposes of author ranking is the rank correla¬ 
tion. The Spearman rank correlation between the fractional citation count and 
the na values is p = 0.79 (when rounded to two decimal places, the result is the 
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Table 1: Number of publications (n), na-value, total and fractional number of citations (with 
or without authors included that only appear once) as well as the //-index (h) for the ten 
top-ranked authors in the dataset according to na-value. The value of na , as well as that 
of the fractional citation count, is rounded to the nearest integer. The Pearson correlation 
coefficient between na and the total citation count in this table is r = 0.95; between na and 
the fractional citation count in this table, it is r = 0.94 if authors who only appear once are 
excluded and r = 0.97 if they are included. 


Author 

n 

na 

Citations 

Frac. cit.° 

Frac. cit. b 

h 

Ronald L. Rivest 

102 

957 

9524 

6531 

3766 

31 

David Kotz 

145 

613 

3987 

1900 

1769 

32 

Guy E. Blelloch 

71 

398 

2006 

997 

929 

23 

Robert D. Blumofe 

13 

321 

1780 

963 

600 

11 

Michael A. Bender 

59 

317 

1409 

583 

496 

19 

David M. Nicol 

68 

270 

856 

384 

319 

17 

Satish Rao 

51 

260 

1964 

834 

664 

22 

Si van Toledo 

60 

251 

994 

638 

557 

17 

Benny Chor 

41 

215 

1824 

793 

590 

18 

C. Greg Plaxton 

46 

189 

1857 

771 

595 

17 


a Authors who only appear once not counted. 
b Authors who only appear once counted. 


same whether or not authors who only appear once in the dataset are excluded 
or not from the denominator), which is slightly stronger than the corresponding 
rank correlation of p = 0.70 with the total citation count. 

4. Concluding discussion 

While the ?ra-ranks agree rather well with traditional measures of high-level 
scientific productivity, contrary to the traditional approach which is purely ad 
hoc , the proposed model of this paper is based on the assumption that the un¬ 
derlying scientific productivity is governed by a factor that can be estimated 
from regression analysis. Arguably, the age-old adage: “practice makes perfect” 
is likely to hold true to some extent also when performing scientific research and 
writing scientific papers, but in the interest of keeping the unknown parameters 
to a minimum, we have not considered this effect in our model. Nevertheless, 
the results support the view that fractional citation counting is a fair way to 
distribute credit, at least within the computer science field. In line with this 
finding, it is important to stress that the strong rank correlation between ci¬ 
tations (fractional or otherwise) notwithstanding, the idea in this paper is not 
to introduce a more “expensive” method to calculate the citation ranks. It is 
only the differences with respect to the traditional ranking that are interesting, 
because they show precisely the extent to which there is a need to step away 
from the simplified author ranking for purposes of promotion and funding. 
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It is interesting to compare the proposed method with that of Tol (2011), 
seeing as it is the one with which it shares the most of the undergirding phi¬ 
losophy. Contrary to Tol (2011), there is no need to assume any form for the 
citation distribution. Since Tol (2011), implicitly at least, assumes an unvary¬ 
ing distribution for each author, 8 his model is also based on the concept of an 
unchanging, inherent “author ability” that is used to produce cited papers. The 
proposed method is hence seen to be more general in its assumptions. For in¬ 
stance, the “ability” to publish pages of scientific output could just as well be 
the underlying variable that we wish to extract statistically; i. e., the biblio- 
metric indicator could be the number of pages per paper instead of citations. 
The idea is that one first identifies a measure of quality ( q ) for the individual 
paper, and then proceeds to analyze the underlying distribution of the authors’ 
abilities (a). 

Note that one of the basic ideas in the Shen-Barabasi (Shen and Barabasi, 
2014) approach—to distinguish coauthor disciplines through their degree of 
“cocitedness” with other papers (essentially distinguishing scientific disciplines 
by the sets of papers that cite a particular paper)—is easily adapted to the 
current algorithm. One needs simply to redefine the quantity q accordingly by, 
for instance, defining q a to be a weighted sum of citations, in which the weight 
of a citation to paper a from paper /3 is determined by the “cocitation strength” 
(Shen and Barabasi, 2014) between papers a and /3: i. e., the number of papers 
citing both a and /3. This is an interesting avenue for further development. 

Finally, I stress once more that in some extreme cases, individual author 
abilities cannot be distinguished even in principle. This occurs, for instance, 
when two authors are “inseparable coauthors”, and the one never publishes 
a paper without the other. This problem is, however, endemic to the whole 
domain of citation analysis and becomes less of an issue in practice as the 
seniority of an author increases. 
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Appendix A. Regression with “destructive authors” in the dataset 

If we relax the requirement that In cq > 0 for any author i , we assume that 
said author i is a “destructive force” which unbeknownst to his coauthors and 
himself sabotages the paper they produce. For completeness, we provide the 
resulting “top ten” authors using this assumption in Table A.2. This provides 
an indirect measure of the robustness of the method. 


8 The distributions that Tol (2011) considers change through the iterations used to solve 
the model, but the converged result is a function, like the a-value, only of the bibliographic 
record and does not change for one and the same author from one paper to the next. 
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Table A.2: Number of publications (n), na- value and total number of citations for the ten 
top-ranked authors in the dataset according to na-value when authors are allowed to have 
a-values less than unity in the statistical fitting. The value of na is rounded to the nearest 
integer. 


Author 

n 

na 

Citations 

Ronald L. Rivest 

102 

1272 

9524 

David Kotz 

145 

957 

3987 

Guy E. Blelloch 

71 

776 

2006 

James Demmel 

7 

725 

631 

Marc Moreno Maza 

45 

618 

514 

Michael A. Bender 

59 

428 

1409 

Sivan Toledo 

60 

374 

994 

David M. Nicol 

68 

339 

856 

Anastassia Ailamaki 

4 

337 

116 

Robert D. Blumofe 

13 

333 

1780 


Like before the top two spots are still claimed by Rivest and Kotz (while now 
their na -values are higher for obvious reasons). With the exception of Demmel, 
Maza and Ailamaki, all of the top ten names appear also in Table 1, indicating 
only a slight reordering. The rank correlations between the ?ra-values and the 
number of citations are p = 0.67 (total), p = 0.73 (fractional with all authors) 
and p = 0.74 (fractional excluding one-time authors) in the whole dataset. The 
corresponing Pearson correlation coefficients are r = 0.78, r = 0.81 and r = 0.78, 
respectively. Thus, even with this “unphysical” assumption, we see a stronger 
correlation with fractional citation counts than with the integer one. 
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